Learning quadratic receptive fields from neural responses to natural stimuli 
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Models of neural responses to stimuli with complex spatiotemporal correlation structure often 
assume that neurons are only selective for a small number of linear projections of a potentially 
high-dimensional input. Here we explore recent modeling approaches where the neural response 
depends on the quadratic form of the input rather than on its linear projection, that is, the 
neuron is sensitive to the local covariance structure of the signal preceding the spike. To infer 
this quadratic dependence in the presence of arbitrary (e.g. naturalistic) stimulus distribution, 
we review several inference methods, focussing in particular on two information-theory-based 
approaches (maximization of stimulus energy or of noise entropy) and a likelihood-based approach 
(Bayesian spike-triggered covariance, extensions of generalized linear models). We analyze the 
formal connection between the likelihood-based and information-based approaches to show how 
they lead to consistent inference. We demonstrate the practical feasibility of these procedures by 
using model neurons responding to a flickering variance stimulus. 



I. INTRODUCTION 



A basic challenge in sensory neuroscience has been to develop a mathematically concise description of how neurons 
encode stimuli into sequences of spikes. There are two main approaches to this task, which differ primarily in how 
much emphasis is placed on anatomical structure versus function. Structure-based modeling starts at the level where 
basic physical processes govern the observed phenomena. A realistic, conductance-based model could thus be used 
to predict the neuron's response to a particular type of applied stimulation (Hodgkin & Huxley 1952 Koch 1999). 



While this bottom-up approach is directly interpretable in terms of biophysical components and processes, it has a 
number of disadvantages: (i) the required parameters might be experimentally inaccessible; (ii) in a sensory context, 
the inputs in this model (the activity of presynaptic neurons) could be related in a complex and intractable manner 
to the stimulus under experimental control; and, (iii), with enough modeling detail, the problem of understanding or 
summarizing the "computation" that the model implements can become as difficult as understanding the real neuron 
itself. 

Functional models, in contrast to the above, attem pt to capture onl y the essence of the neural computation: the 
transformation of stimuli into spiking responses (see Wu et al. (2006) for an in-depth review). These models are 



usually fully learned from data, rather than being derived from the underlying dynamical or physical model (but 



see 


Agiiera y Areas et al. 


Ostrojic & Brunei 


( 


2011)) 



Ostrojic & Brunei (2011)). Two considerations are therefore critical to the success of functional models: whether 
typical electrophysiological recordings can provide enough data for successful inference of the model's parameters; and 
whether efficient inference algorithms for these parameters exist. Because the space of all possible stimuli (e.g. all 
images incident on a retina) and all possible responses (e.g. complete sets of spike arrival times) is vast, our progress 
must depend on making well-chosen simplifying assumptions. One extreme simplification, for example, involves 
varying the stimulus along one "dimension" only, as in the case of the orientation or wavelength of a drifting grating 
visual stimulus, and representing the output by a single scalar quantity, e.g. the average firing rate in a chosen time 
bin. These measurements have traditionally been summarized by tuning curves and have provided basic insights into 
principles of sensory and population coding (Dayan & Abbott 2001). The relevance of the tuning curve approach is, 



however, limited by the choice of the single dimension along which the stimulus is manipulated, which may drastically 
underestimate the true complexity in the structure of the stimuli to which the neuron could respond. Despite strong 
limitations, such studies helped establish the concept of a "receptive field," the region of stimulus space where changes 
in the stimulus modulate the spiking behavior of the neuron. 

Central to the concept of receptive field is the notion of locality in the stimulus or feature space. For instance, a 
ganglion cell in the retina may be sensitive only to specific changes in light intensity that occur within a small visual 
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FIG. 1 A schematic showing linear-nonlinear architectures, (a) The instantaneous firing rate, or probability per unit 
time of emitting a spike, in a linear-nonlinear model neuron is obtained by passing the signal through a linear filter, and 
mapping the resulting value through a pointwise nonlinearity. (b) A multidimensional LN model neuron requires the signal to 
be filtered through K linear filters. The number of filters, K, is usually much smaller than the dimension of the stimulus. The 
stimulus projections are mapped into the firing rate through a A'-dimensional nonlinear function. Without further assumptions, 
inference of these models is tractable from real recordings only when K is small (usually less than 3) . 



angle (Hartline 1940). A productive way of capturing this notion of locality has been to think of a receptive field as 
one or more filters that act on the stimulus; only those stimulus variations that result in the change in filter output 
have the ability to affect the neural response. In this view, the neurons perform dimensionality reduction by projecting 
the stimulus down into a small number of dimensions. Consequently, the success of data analysis techniques built 
around this idea must depend on whether a small number of filters suffices to fully account for the neuron's sensitivity 
and its response properties. 

Methods based in systems identification theory have provided systematic procedures to infer both the receptive 
fields of neurons as well as subsequent computations. These methods usually share two key features. First, they can 
(sometimes necessarily) be used with stimuli that sample the stimulus space broadly, making no explicit assumptions 
about which stimulus features are important. This is in contrast to the restricted stimuli employed for measuring 
tuning curves. Second, the procedures usually involve a series of approximations that can provably yield an ever better 
description of the system if increasing amounts of data are available. Table [i] provides an overview of various functional 
models and related inference methods. Among the earliest to be used successfully, Wiener and Volterra expansions 
helped identify the first- and second-order kernels mapping the stimulus to response time traces in various systems 



(Marmarelis & Marmarelis 


Wiener 


1958 


). However, in 



, 2005 Sakai 


1992 


Schetzen 


1989 



Wiener 1958). However, in many cases strong intrinsic nonlinearities attributed to spike generation would require a 



large number of terms in Wiener- Volterra expansions, despite the fact that the underlying stimulus sensitivity might 
be simpler, and therefore of low order. Models where the (possibly linear) projections of the stimulus in the receptive 
field were decoupled from the nonlinearities underlying spiking, as in linear-nonlinear (LN) architectures illustrated 
in Fig. [I] made further progress possible. 

LN and LN-like models have been used widely and profitably to predict the firing rate traces of single sensory 
neurons, because their parameters can be inferred easily under suitable conditions. However, the more intriguing 
cases are the ones where LN models either perform poorly or fail entirely. One such failure mode is the inability 
to account for the statistics of neural activity beyond the mean firing rate. Specifically, real sensory neurons often 
have variability that is smaller than that attributed to Poisson processes ( |de Ruyter van Steveninck et al. 19971; 



phenomena like refractoriness and spike rate adaptation are not captured by LN models (Berry & Meister 



1998): 



and in neural populations, uncoupled LN models fail to reproduce the basic covariance structure of neural activity 
( Granot- Atedgi et al. 2012 Pillow et al. 2008 Schneidman et al. 2006). Some of these issues can be addressed by 
adding suitable dynamical complexity beyond the linear filtering stage, to make the nonlinearities in spike generation 
more realistic (Keat et al. 2001 Ozuysal & Baccus 20121 Paninski et al. 2004), or by including interactions between 



neurons in models of neural firing ( Granot- Atedgi et al. 12012 Pillow et al. 2008) 
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A different kind of failure of LN models rests on the assumption that stimulus sensitivity occurs through a single 
(or a small number of) linear projections. One example is contrast adaptation, where a simple LN model derived 
from a white noise stimulus of a certain variance fails to accurately predict the response to a stimulus with smaller 
or larger variance (|Baccus k, Meisterj 2002| |Borst fe Egelhaaf 1987 van Hateren 1992 de Ruyter van Steveninck et 



al. 



1986] [Smirnakis et al.| 



1997|). Other examples include the failure to account for the sensitivity of retinal ganglion 

~ 2001) ), 



cells to fine spatial detail (possibly because of nonlinear summation within the receptive field ( Demb et al. 



to stimulus motion (Berry et al. 1999 Gollisch & Meister 2010 Schwartz et al. 2007). Generally, these difficulties 



or 



emerge clearly when the stimulus statistics change or increase in complexity beyond those used to infer the model, 
for instance by becoming more "naturalistic"- i.e. having temporal and spatial pairwise correlation over many scales, 
skewed first order histograms, and statistical structure beyond second order. 

The problems with the LN models can generally be addressed in two possible ways. In the first, LN models can 
be extended to account for a particular phenomenon on a particular stimulus, e.g., by adding a contrast gain control 



mechanism (Schwartz & Simoncelli 2001 Schwartz et al. 2002) or by an ad hoc rescaling of nonlincarities (Brenner 



et al. 2000[ ) to account for contrast adaptation in an experiment where the variance of a Gaussian input is modulated. 



The second approach is more general: by using complex stimuli, including fully natural movies from the start, the 
goal is to find the complete (or close to complete) set of features to which the neuron responds. It is worth noting 
that these two approaches, as well as the associated use of simple vs natural stimulus ensembles, generally reflect 
two motivations for building models of neural encoding in the first place: one is to propose and test specific simple 
models and incrementally improve them, while the other is to infer descriptions that should be valid across a wide 
range of stimuli and conditions from the start. For the former purpose — falsifying a model or developing a simple 
functional form for the stimulus-response relationship — using a stimulus set that is analytically convenient but highly 
un-natural, e.g., white noise, is sufficient. This is because when a proposed model fails on a subset of stimuli, it can be 
excluded or must be extended by additional mechanisms. Until recently, this was the main reason for using systems 
identification methods with noise stimuli. The drive to use naturalistic stimuli comes, on the other hand, from trying 
to find a model that captures from the start the responses to a wide variety of biologically relevant inputs, and from 



the observation that naturalistic stimuli may change even the basic filter responses of cells ( Sharpee et al. 2006 ) and 



engage response mechanisms that are difficult to probe using noise stimulation (e.g., Olveczky et al. ( |2007 l). Potential 
drawbacks with using natural stimuli include technical obstacles in model inference and the statistical intractability 



of the natural ensemble (Geisler 2008 Simoncelli & Olshausen 2001 ). The choice of the stimulus ensemble certainly 



deserves a lengthier discussion; see, for example, Rust & Movshon (2005). 

To find the complete set of stimulus features to which a neuron responds, one can look for multiple linear fea- 
tures, a task for which methodological frameworks exist and have been validated for a small number of features. 
Unfortunately, extracting more than 2 or 3 features becomes intractable because of the curse of dimensionality. A 
possible anatomically motivated simplification of a multi-feature LN model is a cascade LN (an LNLN) model, where 
the nonlinearly transformed filter outputs are linearly summed and passed through the spike-generating nonlinearity. 
Despite some successes (Bolinger & Gollisch 2012 Gollisch & Herz 2005), the general problem of inferring cascading 
models remains technically challenging (usually involving difficult optimizations). A somewhat simpler LNL system 
has proven both to account for the behavior of the Y-type retinal ganglion cells very well, as well as being tractable to 



infer using the sum-of-sinusoids formulation of the Wiener formalism (Victor & Knight 1979[ Victor & Shapley 1979 



1980 ) . A particular case of interest for this review is a special subclass of LNLN models which can be reformulated 
as quadratic-nonlinear models, i.e. models where the initial dimensionality reduction of the stimulus is not a linear 
projection of the stimulus, but rather an arbitrary quadratic function of the stimulus. 

Recently there has been a lot of interest in designing systematic, tractable methods for inferring neural sensitivities 
when the initial dimensionality reduction step is of high-order (e.g. quadratic) 1 . In this paper, we start by presenting 
several biologically motivated examples of quadratic stimulus sensitivity in Section [TXJ We then review several com- 
plementary approaches that can be used to learn quadratic stimulus dependence even when neurons are responding 
to rich, naturalistic stimuli: we discuss the maximally informative stimulus energy (Rajan & Bialek 2012) and the 
maximization of noise entropy ( |Fitzgerald et al.| |2011a|b| Globerson et al. 2009[ ) in Section III.A[ and follow with 
the Bayesian spike-triggered covariance (|P ark fe Pillow||2011 ) and related extensions of generalized linear models to 



quadratic stimulus dependence in Section III.C We show under which conditions information and likelihood based 



approaches lead to consistent inference in the Appendix. 



When we speak of the order (e.g. linear, quadratic etc), we refer to the order of the kernel operating on the stimulus, which can be 
defined unambiguously. In contrast, the order of the neural processing system as a whole depends on the stimulus statistics; for example, 
higher-order statistical structure in the stimulus can conflate first- and second-order responses of the system. Likewise, aspects of the 
response explained by second order kernel inferred even with Gaussian noise depend on the po wer spectrum of the i nput. 



This problem has been worked on by the authors of this review in parallel with the authors of 



Park & Pillow 



(2011). 
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II. HIGH-ORDER STIMULUS DEPENDENCE 



In a typical experiment, a neuron can be driven by a synthetically generated stimulus containing a desired statistical 
structure. For probing the visual system for example, this stimulus might be a random binary checkerboard, a drifting 
grating, or full-field light intensity flicker. If the neuron's response depends solely on the stimulus presented in the 
recent past of duration T (and possibly on its own previous spiking behavior), we can restrict our attention to 
stimulus clips s of length > T. These clips are drawn from a distribution P(s) that characterizes the stimulus; the N 
components of vector s represent successive stimulus values in time and optionally across space. Our task is then to 
infer the dependence of the instantaneous probability of spiking (firing rate) at time t on the stimulus, s(i), presented 
just prior to t. 

If the neuron is well described by the linear-nonlinear (LN) model, where the spiking rate r is an arbitrary positive, 
point-wise, nonlinear function / of the stimulus projected onto the filter, r(s) = /(k-s), and the stimulus distribution is 
chosen to be spherically symmetric, P(s) = P(|s|), we can use the spike-triggered average (STA) to obtain an unbiased 
estimate of the single linear filter k (de Boer & Kuyper 1968 Simoncelli et al.| 2004). Spike-triggered covariance 



(STC) generalizes the filter inference to cases where the firing rate depends nonlinearly on K > 1 projections of 
the stimulus, r(s) = /(ki • s, k2 ■ s, . . . , kx • s) ( |de Ruyter van Steveninck &: Bialek| |1988[ ). The number of relevant 
linear filters, K, is equal to the number of nonzero eigenvalues of the spike-triggered covariance matrix. A successful 
application of STC requires -P(s) to be Gaussian, and the number of filters K be small (usually < 3) to ensure an 
adequate sampling of the filters and the nonlinearity /, given the data obtained in the typical experiment (however 
when inferring only the linear part of such models as many as 14 filters have been estimated (Rust et al.| 2004)). 



STC has been used successfully, for example, to understand the computations performed by motion sensitive neurons 
in the blowfly (Bialek & de Ruyter van Steveninck 2005), to map out the sensitivity to full-field flickering stimuli in 

~ 2004 " 



salamander retinal ganglion cells (Fair hall et al. 20061, to explore contrast gain control (Rust et al 



et al. 2002), and to understand adaptation in the rodent barrel cortex (Maravall et al. 2007) 



Schwartz 



Before moving on, it seems appropriate to return once more to the Wiener formalism and contrast it with spike- 
triggered methods for recovering LN models. The underlying assumptions of the two approaches may seem substan- 
tially different: first, because of the presence of the nonlinear (N) transformation in the LN model, and second, because 
the output of the LN model is usually taken to predict the rate of a stochastic point process, while Wiener series is 

Nevertheless, it is easy to see that when uncorrelated 



1958) 



intended for analyzing deterministic systems (Wiener 
(i.e. white) Gaussian noise is used to extract the filters of the LN model using spike triggered average (STA) and 
spike triggered covariance (STC), STA and STC also provide unbiased estimates (up to a scaling factor) of first- and 
second-order Wiener kernels. The difference arises in subsequent analysis steps: in case of LN models, STA and STC 
are used solely as dimensionality reduction steps to identify the relevant subspace of the stimuli in which the nonlinear 
transformation acts, while in the Wiener formalism, STA and STC literally are the first two terms in a functional 
expansion that provides the best least-squares fit to the observed firing rate. Victor & Johannesma ( 1986 ) have further 



demonstrated that the Wiener formalism is a special case of a general probabilistic maximum entropy framework for 
describing joint distributions of stimuli and responses. In this framework, for example, the classic Wiener formalism 
is recovered if the stimulus distribution is Gaussian, and the response variable is also Gaussian with additive noise. 
If, on the other hand, the output variable is binary (spike / no-spike), the same maximum entropy approach reduces 
to identifying LN-type models with exponential nonlinearities. 

While powerful and simple to use, spike-triggered covariance (STC) only works if Gaussian stimuli are employed, 
and is feasible only if K is small. The Gaussian ensemble can be a serious restriction for neurons that do not 
respond well (or at all) to unstructured stimuli; furthermore, we are likely to miss several neural mechanisms that 
depend on naturalistic statistical structure, such as correlations, intermittency etc, if the neuron responds to Gaussian 
stimulation. A versatile method should therefore be able to successfully infer the multiple-filter dependence of a neuron 



probed with a stimulus of arbitrary complexity. Maximally informative dimensions ( MID) (|Sharpee et al. 2004 ) or 



likelihood inference for single-filter generalized linear models ( Gerwinn et al.| 2010| Paninski 2004 Pillow 2007 



Pillow et al. 2008 Truccolo et al. 20041 have been used to this end when the dependence is linear, but the attempts 



to incorporate full quadratic stimulus dependence have been less common. 

There are several instances of quadratic stimulus dependence. Let us consider a situation where the neuron has 
a vanishing spike-triggered average, as with a complex cell, non-phase-locked auditory neurons ( Recio-Spinoso et 
al. 20051, or motion-sensitive neurons. In these cases a natural starting point would be a search for more than a 
single linear filter. For a model complex cell in the visual cortex, we would find two phase-shifted vectors ki and k2 
that together form a quadrature pair, such that the most informative variable concerning the neuron's firing is the 
"power," 



^(s) = /[(k 1 .s) 2 + (k 2 .s) 2 ] 



(1) 



Similarly, models of contrast gain control in the retina also include sensitivity to second-order features in the 
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(a) Flickering variance stimulus (b) Linear filters for different C (c) Nonlinearities for different C 




Local contrast Time relative to spike, s Filter *stim 

FIG. 2 A synthetic contrast-adapting neuron probed with the "nickering variance" stimulus. The instantaneous 
spiking rate is given by r(t) = /(ko ■ s(t) +s(f) T Qs(t) + /j,), where /(•) = log(l + exp(-)), fi is an offset (bias), and the quadratic 
kernel Q is a rank 2 matrix with a quadrature eigenvector pair, (a) The stimulus is sampled at A = 1 ms scale and is given by 
s(t) = exp(A(t))w(t), where w(t) is given by uncorrelated white noise of fixed variance, and A(t) is a gaussian noise process with 
correlation time r c = 1 s. The stimulus can be chopped into segments of duration r < t c , which can be sorted by local contrast 
C (intensity of red). Spike-triggered average analysis can be applied to recover effective LN models for all stimulus segments 
sharing the same local contrast, (b) The linear filters recovered at various contrast levels (shade of red; filters displaced along 
vertical axis for easier readability). At lower contrasts the neuron produces less spikes, making the filter estimate more noisy, 
but the filter shape is constant across a range of C and closely approximates the model filter ko. (c) The nonlinearities for 
different contrast levels C (plot legend, shades of red; the nonlinearities displaced along vertical axis for easier readability). 
The slope of the nonlinearity decreases with increasing contrast (although the adaptation is not perfect, in this example), to 
prevent quick saturation of the response at high C. 



stimulus, with the spiking probability of the form (Schwartz et al. 2002 1, 



r(s) 



/(ko • s) 



(2) 



where the quadratic terms in the denominator scale down the gain at high contrast (in this case however, the neuron 
has a non-vanishing linear filter ko). A simulated model neuron showing contrast adaptation is shown in Fig. [2^,, 
featuring both the first- and second-order stimulus sensitivity. The model neuron is probed with a "flickering variance" 
stimulus, in which the variance of the white noise (with a very short correlation time) is dynamically modulated by 
a noise process correlated across a longer timescale (c.f. Fair hall et al.| (2001)). With this synthetic stimulus, the 



separation of timescales allows us to partition the stimulus into chunks with approximately constant variance in 
luminance, <j\. This variance is directly related to the temporal contrast, C — a^/L, because the average mean light 
intensity L is kept constant. Within each stimulus segment, we can use STA to recover the LN model, as shown in 
Figs. [2]d,c. Our real goal, however, is to infer a joint model valid across the whole stimulus, and to do so ultimately 
with naturalistic stimuli with scale-free power spectra, where no clear separation exists between the fast fluctuation 
and slow variance modulation. 

We can describe these and similar examples by a generic "quadratic" model neuron which is sensitive to a second- 
order function of the input (parametrized by a real, symmetric matrix Q) in addition to the linear projection 
(parametrized by the filter kg): 



(s) = /(ko-s, s T Qs). 



(3) 



Graphically, while a threshold LN model with a linear filter corresponds to a classifier whose separating hyperplane is 
perpendicular to the filter, the proposed LN model with a threshold nonlinearity and a quadratic filter Q is selective 
for all stimuli that lie outside an A^-dimensional ellipsoid whose axes correspond to the eigenvectors of Q. 

For the contrast gain control model described in Eq ^ the matrix Q is of rank M, with eigenvalues Wi and 



eigenvectors kj,i > 0. The complex cell example described in Eq M has k = and Q = 2~Zj=i kjk? ; in other 
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words, Q is a rank 2 matrix. While these examples feature quadratic dependences involving matrices of low rank, 
it is possible to extend these models to biologically relevant cases where the matrix does not have to be low rank 
(Rajan & Bialek 2012). For example, the probability of spiking could be a nonlinear function of the "power" pit), 

m = fw 



where the power is given by: 



pit) = / dTf 2 {r) 



dt'hit 



t')s{t') 



(4) 



here s(t) is the stimulus, and /i and /2 are linear filters, such as those used to describe non-phase-locked auditory 
neurons. If the smoothing time of the second filter is larger than that of the first filter /i, it has been shown in 



(Rajan & Bialek 2012) that the quadratic kernel Q for this model has a rich (full-rank) spectrum. 



In the next section we review methods that permit inference of low- or full- rank quadratic kernels, Q. 



III. INFERRING QUADRATIC STIMULUS DEPENDENCE FROM DATA 

Every real, symmetric matrix can be spectrally decomposed into Q = 2j=i AjkikJ. The response of the quadratic 
model is thus r = f Yli=i ^iO^i ' s ) 2 > explicitly demonstrating that quadratic models are special cases of the LNLN 
cascade, where the first linear stage involves applying the filters k^, the first nonlinear stage squares the projections, 
the second linear stage is a summation with weights A,;, and the last nonlinear transformation is /(•)■ The spec- 
tral decomposition implies that we could try recovering the quadratic dependence of Q in Eq (J3j) by, for example, 
multidimensional MID (see Table [I]), hoping to infer all {k^} as orthogonal informative dimensions. While formally 
true, this is infeasible in practice because maximizing the mutual information would involve sampling TV-dimensional 



distributions from stimulus samples that arc limited in number by the number of spikes (Sharpee et al. 2004). The 
same sampling problem would reappear when trying to estimate the nonlinearity, /(ki • s, k2 • s, . . . , k/v • sj! 

To address this problem efficiently, we formulate the inference problem by explicitly assuming quadratic dependence 
on the stimulus: in this case, the stimulus immediately gets projected down to a single scalar variable x = s T Qs, 
meaning that information-theoretic quantities, the likelihood, as well as the nonlinearity /(•) will only depend on the 
stimulus through x. This makes inference problem tractable even when Q is of high rank. Clearly, this advantage is 
gained by assuming that projections onto eigenvectors of Q combine as a sum of squares. This assumption is not a 
mere mathematical convenience: as we have shown previously, well-known phenomena of phase invariance, adaptation 
to local contrast or sensitivity to the signal envelope are all examples of true second-order stimulus sensitivity in real 
neurons. Additionally, response phenomena in the visual cortex grouped together as relating to the non-classical 



receptive field could also be manifestations of quadratic or higher-order sensitivity ( Zetzsche & Nuding 2005 ) 



A. Finding quadratic filters using information maximization 



Despite their utility and simplicity, spike-triggered methods require the use of statistically simple stimuli and in 
particular, exclude the use of stimuli with naturalistic statistics, e.g. those with 1// spectra, non-Gaussian histograms 
and/or high-order correlations. This is a big challenge when studying neurons beyond the sensory periphery that are 
responsible for extracting high-order structure, or neurons unresponsive to white noise presentations, for example 
those in the auditory pathway. To address this issue and recover the filter(s) in an unbiased manner with an arbitrary 



stimulus distribution, maximally informative dimensions (MID) (Kouh & Sharpee 2009 Sharpee et al. 2004 2006) 



have been developed and utilized to recover simple cell receptive fields, among other examples. MID looks for a linear 
filter k that maximizes the information between the presence/absence of a spike and the projection x of the stimulus 
onto k, x = k • s. Information per spike is then given by the Kullback-Leibler divergence of P(a;|spike), the spike- 
triggered distribution (the distribution of stimulus projections preceding the spike) and P(x), the prior distribution 
(the overall distribution of projections): 



^spikc = Dkl [P(x|spike)||P(x) 



dx P(x|spike) log 2 



P(x|spike) 

W) 



(5) 



Given the spike train and the stimulus, finding k becomes an information optimization problem in / sp ik e that can be 
solved using various annealing methods, although the existence of local extrema makes this a nontrivial task. 

Spike-triggered methods and MID do not explicitly assume a form for the nonlinearity /(•) in the LN model; instead, 
they provide unbiased estimates of the filter(s), and once the filters arc known, the nonlinearity can be reconstructed 



using the Bayes' rule from sampled spike-triggered and prior distributions: 

s , x P(a;|spike)P(spike) 

f(x) ex P(spike|z) = V ' . (6) 

where P(spike) is directly proportional to the average firing rate during the experiment. 

In classical MID, one finds a (set of) linear filter(s) by maximizing Eq. ^ with respect to k. In Rajan & Bialek 



(2012), this approach was extended to quadratic stimulus sensitivity, as follows. A quadratic filter Q can be recon- 
structed from an observed spike train by maximizing the information in Eq (|5j), where x is now given by x — s T Qs. 



Taking a derivative of Eq ^ with respect to Q gives us a gradient, 



V Q / = fdx P Q (x) [<ss T | ,,s P ike) - <ss T | x)] ± ( ^gg^ ) 



(7) 



where ( • ) indicates averaging over the spike-triggered and prior distributions respectively, and the subscript Q makes 
the dependence of the probability distributions explicit. Only the symmetric part of Q contributes to x, and the 
overall scale of the matrix is irrelevant to the information, making the number of free parameters N(N + l)/2 — 1. 
To learn the "Maximally Informative Stimulus Energy" or the quadratic filter Q, we can ascend the gradient in 



successive learning steps (Rajan & Bialek 2012), 



Q^Q + 7V Q 7. (8) 

The probability distributions within the gradient are obtained by computing x for all stimuli, choosing an appro- 
priate binning for the variable x, and sampling binned versions of the spike-t rigge red and prior distributions. The 
(ss T ) averages are computed separately for each bin; and the integral in Eqs (5|7| and the derivative in Eq ffi are 



approximated as a sum over bins and as a finite difference, respectively. To deal with local maxima in the objective 
function, we use a large starting value of 7 and gradually decrease 7 during learning. This basic algorithm can be 
extended by using kernel density estimation and stochastic gradient ascent /annealing methods, but we do not report 
these technical improvements here. 

It is possible to select an approximate linear basis in which to expand the matrix Q, by writing 



M 

Q = 5> M B^. (9) 



The basis can be chosen so that increasing the number of basis components M allows the reconstruction of progressively 
finer features in Q. We considered as {B' M ^} a family of Gaussian bumps that tile the space of the N x N matrix 
Q and whose scale (standard deviation) is inversely proportional to vM. For M — > N 2 /2 the matrix set becomes 
a complete basis, allowing every Q to be exactly represented by the vector of coefficients a. In such a matrix basis 
representation, the learning rule becomes 

N BT 

a ^% + ^EaQ- B ?- ( 10 ) 

i,j=l ^ tJ 

where applying the chain rule on Vq/ yields the Trace[VQ(a) ■ B] update term at each step. 

We illustrate this approach with two examples. In the first example we make use of the matrix basis expansion 
from Eq ^ to infer a quadratic kernel K that is of arbitrarily high rank. For K we used a highly-structured 500 x 500 
matrix as shown in Fig. [3ja). While this is not an example of a receptive field from a real neuron, it illustrates the 
validity of the approach even when the response has an atypical and highly structured dependence on the stimulus. 
The stimuli were natural image clips from the Penn Natural Image database, flattened into a high-dimensional vector 



representation s ( |Tkacik et al.[ 2011 1, and the spikes were generated by thresholding the term s T Ks. Gaussian basis 
matrices, similar to the 225 shown in Fig. |3jb) were used to expand the quadratic kernel, reducing the number of 
optimization parameters from ~ 2.5 x 10 5 to a few hundred. We start the gradient ascent with a large 7 value of 1 
and progressively scale it down to 0.1 near the end of the algorithm; Fig. [3^e) shows the information plateauing in 
about 20 learning steps. The maximally informative quadratic filter reconstructed from 400 basis coefficients is shown 
in Fig. [3^d) . Figure [3jc) demonstrates how the root-mean-squared reconstruction error systematically decreases as 
the number of basis functions M is increased from 4 to 400, improving precision. Insets show 2 inferred matrices with 
M = 100 (corresponding to the first dot) showing a marked improvement with M — 225 (corresponding to the second 
red dot). Reconstruction error drops to ~ 1% for M — 400. 
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FIG. 3 Reconstructing a high rank quadratic filter using stimuli extracted from natural scenes, (a) A complex 
high-rank randomly generated matrix K will be used as a quadratic filter of a model cell that fires whenever s T Ks exceeds a 
fixed threshold. K is thus the true quadratic filter for our threshold LN model neuron, (b) A collection of 225 Gaussian matrix 
basis functions whose peaks densely tile the matrix space; a trial matrix is constructed as a linear sum (with coefficients {cv}) 
of the basis matrices, and information optimization is performed over {a M }. (c) The normalized reconstruction error, shown 
in black dots, decreases as the number of basis functions M increases from 4 to 400; with enough data perfect reconstruction 
is possible as M approaches the number of independent pixels in K. The two red dots show reconstructions with M = 100 or 
M — 225 basis functions, respectively, (d) The reconstructed, maximally informative matrix kernel Q after maximizing mutual 
information using 400 basis functions, (e) Mutual information increases as learning progresses in steps given by Eq. (j^Sj) , peaks 
at step 40 and remains unchanged thereafter. Learning step 100 is the point where the maximally informative Q is extracted. 



In contrast to standard MID where the number of spikes required grows exponentially in the number of filters 
extracted, the data requirement for this approach is proportional to the square of the stimulus dimension for a 
matrix kernel with no additional structural simplifications (these data requirement- and performance-related issues 
are explored in detail in Rajan & Bialek (2012)). For the examples shown in the paper, expansion in matrix basis 
reduces this number to the order of stimulus dimension, making this procedure pertinent for experimentalists. 

The second example shows the MISE analysis of the synthetic neuron presented in Fig. [2] where stimulus-response 
relationship is more biologically realistic, through a smooth nonlinear function / and both a linear as well as a 
quadratic kernel. The analysis is applied to the flickering variance stimulus without partitioning it into regions of 
fixed contrast. With ~ 2 x 10 4 spikes, the method recovers the linear filter ko as well as the quadratic kernel, 
which turns out to have the two dominant eigenvectors k 1; k 2 , corresponding to the quadrature pair of filters used to 
construct Q, as shown in Fig. [4]d. 

These examples show that quadratic filters can be extracted using information maximization for both low-rank and 
full-rank matrices, under natural stimulation and with a realistic numbers of spikes. Importantly, for cases where 
the stimulus sensitivity is both linear and quadratic, MISE does not explicitly assume that the effects of two filtering 
operations are additive, i.e. that x = ko • s + s T Qs; rather, the dependence can be an arbitrary 2D nonlinear function, 
/(ko • s, s T Qs). Unlike the quadratic generalizations of GLM presented below, this allows MISE to fully recover forms 
of contrast gain control that have a parametric form similar to Eq. ([2]). 



B. Finding quadratic filters using maximization of noise entropy 



Another information-theoretic approach for inferring single neuron sensitivities is derived from the principle of noise 
entropy maximization ( Fitzgerald et al. 2011a|b Globerson et al. 2009 ) . Suppose that the spiking or silence of a 
chosen neuron in a time bin indexed by t is represented by a binary variable yt £ {0, 1}. From data, we can reliably 
estimate certain statistics of the neural response, such as the average spiking rate (yt)t, the spike-triggered average 
(y t s(t)) t , or the spike-triggered covariance (j/ t s(i)s(i) T ) t , where the brackets ( • ) t denote averaging across the duration 
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of the experiment. In general, all these statistics are of the form (0^(s)y t )t, where indexes the different operators 
whose expectation values we are computing. 

The crucial step is to look for maximum entropy approximations to P(y\s), the distribution of the (binary) neural 
response given the stimulus. Maximum entropy distributions are as unstructured (random, therefore parsimonious) as 
possible with the constraint that they exactly reproduce the measured expectation values of a chosen set of statistics, 
{Ofj,} ( |Jaynes 1957a|b ). When the variable y is binary, it can easily be shown that these distributions have the form 
of the logistic function, 

P(y = Ms)= 1 + e - fW ' < n > 

where F resembles the free energy in statistical physics: 

F(s) = ^A m O m (s), (12) 

and A M are the Lagrange multipliers that have to be set such that the set of statistics measured in the data equals 
the expectation values of the same operators under distributio n P, i.e. (Q^{s)y)p = (0^ (s)y) t . To apply this general 



framework to the inference of quadratic filters, the authors of Fitzgerald et al. (2011b) choose the mean firing rate, 



STA and STC as constraints, which yields the following response distribution: 

P(y=l|s) = , 1 ^— T , (13) 

VJ 1 1 1 + cxpQu + k ■ s + s T Qs) ' v ; 

where {fi, k ,Q} act as the Lagrange multipliers A M conjugated to the operators {y, ys, yss T }. Numerically, the task 
is to solve for parameters {/x, k , Q} that satisfy a set of constraints: (y) t = (y}p (matching the measured mean firing 
rate to that of the model), (ys) t — (ys)p (matching the measured STA to that of the model), and (yss T ) t = (yss T )p 
(matching the measured STC to that of the model). This is a convex optimization task and can be solved by conjugate 
gradient descent. 

An attractive feature of this approach emerges when we rewrite the information per spike /(spike; s) as a difference 
between the total and the noise entropy as follows: 

/(spike; s) = p (s) E P ^ lo ^ = S PW] - (S[P(y\s)]) s , (14) 

where S[P(x)] = —'}2 x P{x)\og2P{ x ) 1S the entropy of P(x). The first term (total entropy) is fully determined by 
the mean spiking rate (y) t , S[P(y)] = — (y)t log 2 (y)t — (1 — (y)t)log 2 (l — (y) t ) because y is a binary variable. The 
mean firing rate is one of the statistics constrained in the model for P(y\s), ensuring consistency. Since our model for 
P(y\s) has maximum entropy given the observed constraints, we are effectively setting an upper bound on the noise 
entropy (5[P(j/|s)]) s , and therefore a lower bound on the mutual information /. As more and more statistics 0(s) 
are included as constraints into the maximum entropy model for Eq. ( |11[ ), the noise entropy must progressively drop 
and information increase towards the true value (which is bounded by the output entropy). At the point where this 
lower bound on information meets the actual information per spike (which can be empirically estimated from, e.g., 
repeated stimulation ( jBrenner et al. 2000)), we obtain the complete list of the relevant stimulus statistics {O^} that 



characterize the sensitivity of the neuron 



In Fitzgerald et al. (2011b), the authors show that this framework is applicable for inferring quadratic neural filters 
on synthetic and real data, and compare it to MID. This method is applicable to any stimulus ensemble, but requires 
assumptions beyond those needed for MID or MISE: namely, that the nonlinear function is logistic, and that the 
contributions of the linear and quadratic filters add. The advantage of the method is that the problem is convex, 
does not suffer from the exponential curse of dimensionality (like multi-dimensional MID), and is flexible, permitting 
various new constraints (beyond the STA and STC) to be used in constructing models for the stimulus-conditional 
distribution P(y\s). 



C. Finding quadratic filters in a likelihood framework: GLM extensions and Bayesian STC 



A powerful technique for modeling neural spiking behavior is the generalized linear model (GLM) framework 

Recently GLM has been used to account for the stimulus sensitivity, dependence 



Paninski 


2004 


Truccolo et al. 


2004) 



on spiking history, and connectivity in a population of 27 retinal ganglion cells in the macaque retina ( Pillow et al. 
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2008). For a single neuron, the model assumes that the instantaneous spiking rate r(t) is a nonlinear function / of a 



sum of contributions, 



r(t) = /[k-s(t) + q-y(t_)+M] 



(15) 



where k is a linear filter acting on the stimulus s, q is a linear filter acting on the spiking history y(t_) of the neuron, 
and fi is an offset or an intrinsic bias towards spiking or silence. When the stimulus and the spike train are discretized 
into time bins of duration A, the probability of observing (an integral number of) y t spikes is Poisson, with a mean 
given by r t A (where the subscript indexes the time bin). Here, we neglect the history dependence of the spikes (with 
no loss of generality) and focus instead on the stimulus dependence; since each time bin is conditionally independent 
given the stimulus (and past spiking), the log likelihood for any spike train {y t } is (Pillow 2007): 



logP({y t }|s) = ^y t logr t - A^r t + c, 



(16) 



where c is independent of both /i and k. This likelihood can be maximized with respect to /i and k (and optionally, with 
respect to g) given adequate number of spikes, providing an estimate of the filters from neural responses to complex, 
even natural stimuli. In contrast to maximally informative approaches, such as the stimulus energy derived in Section 
III. A (Rajan & Bialek 20121, the functional form of the nonlinearity / is an explicit assumption in likelihood-based 
methods like GLM. For specific classes of the function /, such as f(z) — log[l + exp(z)], exp(z) or [1 + exp(z)] -1 , the 
likelihood optimization problem is convex and gradient ascent is guaranteed to find a unique global maximum. 

While the tractability consequent to convexity of the objective function is a big strength of this approach, the 
disadvantage is that if the chosen nonlinearity / is different from the true function /' used by the neuron, the filters 
inferred by maximizing likelihood in Eq ( 16 1 could be biased. If we relax the stringent requirement for convexity, 



we can choose more general nonlinear functions for the model, for example by parametrizing the nonlinearity in a 
point-wise fashion and inferring it jointly with the filters. For this discussion however, we assume that / has been 
selected from the specific class of nonlinearities guaranteed to yield a convex likelihood function. 

How can we extend GLM to situations where the neuron's response is more complex than a single linear projection 
of the stimulus? We will start with a proposal and follow up with a closely related formulation of Park fc Pillow| 
(2011) developed in parallel, which has provided a more complete analysis and several interesting extensions. One 
possibility is to expand the stimulus clip s of dimension N into a larger space first, for instance by forming ss T (of 
dimension N x N), and then operate on this object with a filter, i.e., Yl% j=i{ a i s j)Qij • Such a term can be added to 



the argument of / in the model exemplified in Eq (15). Specifically, we propose a "Generalized Quadratic Model" of 
the following form, 



-(*) = / [k • s(t) + s T (i)Qs(t) + g • y(i_) + /i] 



(17) 



If we want to retain convexity, we cannot expand Q in its eigenbasis and infer its vectors by maximizing the likelihood 
directly, because the eigenvectors appear quadratically. However, we can expand Q into a weighted sum of matrix 
basis functions, as in Eq (pi), making the argument of / a linear function of basis coefficients a M , 



/ M 

r(t) = / k • s(t) + ]T [s T (t)BM S (t)] <*„ + g • y(t_) 



(18) 



Existing methods for inferring GLM parameters (Pillow et al. 20081 can be used to learn both the linear filter and 



the quadratic filter Q efficiently. After extracting Q we can check if a few principal components account for most of 
its structure (this is equivalent to checking whether Q is indeed a low rank matrix). In sum, this procedure provides 
a way of extracting multiple filters with GLM that is analogous to diagonalizing the spike-triggered covariance matrix 
on the Gaussian stimulus ensemble. 

We have implemented such a quadratic extension to the GLM and applied it to the flickering variance stimulus 
shown in Fig. [2] The results are shown in Fig. [3Jt. The quadratic kernel correctly recovers a quadrature pair of filters; 
we similarly recover the correct linear filter k . While this method is restricted to a linear combination of first- and 
second-order filters within the nonlinearity, the distinct advantage over MISE is that the inference problem is convex 
with the appropriate nonlinearity. 



notation of this paper) 



Park & Pillow (2011) consider an exponentiated general quadratic function of the following form (rewritten in the 



r(s) = exp (s T Qs + k • s + /i) 



(19) 
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FIG. 4 Recovering the synthetic model of the contrast gain control cell using the flickering variance stimulus. 

The spikes were simulated using the model presented in Fig. [5] (a) The true quadratic kernel, Q, of the model, is a matrix 
of rank 2 with the two niters combining into quadrature to estimate the signal "power" or variance, (b) The reconstructed 
kernel using the quadratic extension of the GLM; the space of matrices was spanned by a 85-dimensional basis of Gaussian 
bumps (some of the granularity can still be seen in the reconstruction). The dominant eigenvectors of the inferred matrix are 
shown in (c) in blue and green (solid black lines show the true values); shown is also the recovered linear filter (red circles) 
and its true value (solid black line). The inference of the same model using MISE shows quick convergence in (d) and the 
recovered quadratic kernel in (e). (f) The linear filter and the eigenvectors of the quadratic kernel recovered with MISE 
(circles), compared to the true values (black solid line). Note that quadratic filter eigenvectors are only determined up to a 
sign. 



First, the authors show that under a Gaussian stimulus ensemble, the expected log likelihood can be expressed in terms 
of the STA, STC, and the covariance matrix of the stimulus, and derive the closed-form expressions for maximum 
likelihood estimates of the quadratic kernel, linear filter, and the bias. Next, the generalization to arbitrary stimuli 
is achieved by numerically optimizing the true (as opposed to "expected") likelihood. In contrast to our suggestion 
of using the matrix basis expansion (which becomes an implicit regularizer upon choosing the dimensionality of 
the basis), Park and Pillow implement Bayesian regularization by imposing a prior on the quadratic kernel. This 
important suggestion is implemented as follows. 

The matrix is first decomposed into the eigensystem, Q = X^iLi a i' w % vf 'i i where w are not forced to have an L 



norm of 1, and cr,; = ±1 to indicate whether the filter i is excitatory or suppresive (as in STC (Schwartz et al. 



2006)). Then, a zero-mean Gaussian prior Mi^ 1 a. i 1 I) is put on each eigenvector w^, where the hyperparameter a 



determines the variance of the elements of eigenvector i; on — > oo corresponds to eliminating the direction i from 
the quadratic kernel and reducing its rank by 1. Next, an iterative algorithm is described for alternating between 
optimizing the likelihood with respect to model parameters, and optimizing the evidence given the parameters with 
respect to hyperparameters oli. This procedure correctly and efficiently identifies the rank of the quadratic kernels in 
synthetic examples, providing an automatic alternative for distinguishing "significant" from sampling-noise-induced 
eigenvectors in the STC and quadratic kernel inference. Finally, the authors show that Eq. ( 19 1 can be further 
generalized at no additional computational cost from the exponentiated quadratic function to a wider class of elliptic 
nonlinearities. 

To summarize, the reviewed work shows that the Bayesian generalization of STC and the generalization of GLMs 
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to quadratic stimulus dependence yield equal probabilistic models for neural encoding that can be efficiently inferred 
for a restricted class of nonlinear functions. Attention needs to paid, however, to maintain the convexity of the 
procedure and deal with the large number of parameters in the quadratic kernel. To this end, basis expansions as 
well as regularization with Bayesian priors seem like feasible candidates. 



IV. DISCUSSION 



While powerful conceptually, the notion that neurons respond to multiple projections of the stimulus onto orthogonal 
filters is difficult to turn into a tractable inference procedure when the number of filters is larger than a few. To address 
this concern, alternative encoding models have recently been proposed where the neuron can be sensitive to higher- 
order features in the stimulus. Instead of being described by multiple linear filters, the neuron's sensitivity is described 
by a single quadratic filter (and optionally an additional linear filter). We have reviewed several inference methods 
for such quadratic stimulus dependence: two based on information maximization and the other based on maximizing 
the likelihood in an extension of generalized linear models. With MISE, no assumptions are made about how the 
projection onto the quadratic filter combines with the linear filter projection, and how both map into the probability 
of spiking. This approach yields unbiased filter estimates under any stimulus ensemble, but requires optimization in a 
possibly rugged information landscape. Noise entropy maximization is a flexible, maximum-entropy based framework 
for modeling the probability of spiking given stimulus. It is computationally tractable and provides a convenient 
bound on the information per spike, but assumes a particular form of the nonlinearity. Alternatively, with a specific 
choice of nonlinearity and filter basis, likelihood inference within the GLM class can be extended to quadratic stimulus 
dependence while retaining the convexity of the objective function. By formulating the problem as Bayesian inference 
and choosing sparsifying priors for the quadratic filter, the true rank of the quadratic filter can also be inferred from 
data. 

All these approaches for inferring quadratic stimulus dependence are complementary; as we show in the appendix, 
maximum likelihood and information maximization inference also provide consistent filter estimates under defined 
conditions. A possible way to benefit from the tractability of likelihood formulations and maximization of noise 
entropy could be to use them to initialize a more general search using information maximization, in the hope that 
this would avoid the problems with the rugged information landscape, and remove the restrictions on the additive 
combination of linear and quadratic features. 

Examples of recent work establishing connections between higher-order structure of natural scenes and neural 
mechanisms beyond the sensory periphery (e.g. Karklin & Lewicki (20091; Tkacik et al. (2010)) make the development 
of corresponding methods for neural characterization, such as the ones presented here, very timely. Phenomena like 
phase invariance, adaptation to local contrast or sensitivity to signal envelope are widespread features of sensory 



neuron responses (IBaccus & Meister 2004 Hubel & Wiesel 1965 Touryan et al. 2002 1 . Moreover, as our abilities 



to record in vivo from the sensory systems of awake and behaving animals expand, so should the methods to analyze 
such recordings, where the relevant stimuli may no longer be perfectly controllable because of the animal's interaction 
with the environment (Kerr & Nimmerjahn 2012). The methods presented here will help us systematically elucidate 



sensitivity to higher-order statistical features from responses of sensory neurons to natural stimuli. 



Appendix: The relationship between information theoretic and likelihood-based inference 



We now demonstrate analytically that under rather general assumptions, the linear or quadratic filters obtained 
by maximizing mutual information match the filters inferred by maximizing the likelihood. We extend a reasoning 
we used previously in the context of inferring protein-DNA sequence-specific interactions in Kinney et al. (2007), to 



neural responses. See also Kouh & Sharpee (2009) and references therein for a similar demonstration. 

In the following, x remains the projection of the stimulus s onto the linear (x t = k ■ s t ) or quadratic (x t = s^Qs t ) 
filter, with time discretized in bins of duration A and indexed by subscript t. We collect all parameters that determine 
the filter into a vector 9± . Given a single Xt , yt spikes are generated according to a conditional probability distribution 
7r(y t |x t ). This probability distribution is typically assumed to be Poisson with mean given by f{xt) in the case of 
GLM, but we take a different approach. We discretize x t into x = 1, . . . ,K bins and parameterize n(y t \x t ), which is 
a ^niax x K matrix, by a set of parameters O2 ■ Apart from assuming a cutoff value for the number of spikes per bin 
y ma (which can always be chosen large enough to assign an arbitrarily low probability to observing > y max spikes in 
any real dataset) and a particular discretization of the projection variable x, we leave the probabilistic relationship 
Tr(y\x) between the projection and spike count completely unconstrained. The transformation from the stimulus to 
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the spikes is then a Markov chain, fully specified by 8 — {81,82}, 



k or Q 



->■ X t 



Vt- 



(20) 



The likelihood of the spike train {y t } given the stimulus s is P({y t }\s) = JJ t=1 n(y t \x t ), where T is the total number 
of time bins in the dataset. With x discretized into K bins, any dataset can be summarized by the count matrix 
c yx = Y^t=i ^(fi Ut)S(x, x t ), where 5 is the Kronecker delta; note that c yx = Tp(y, x), where p is simply the empirical 
distribution in the data of observing y spikes jointly with projection x. In terms of c, the likelihood of the observed 
spike train is P({y t }\s) — n^=cf Yl^=i 7r {y\x) Cyx ■ Assuming that x is adequately discretized and that ir is Poisson 
with mean f(x), we will recover the generalized likelihood of Eq ( |16[ ). 

Suppose that we are only interested in inferring the filter (parametrized by 8±), but not the filter-to-spike mapping 
7r (parameterized by 82)- While avoiding any assumptions about the structure of 7r, we can integrate the likelihood 
over O2 with some prior P p (02) such that 



P({yt}\s) 



Ml Pr, 



1 



(21) 



y.j: 



This resulting likelihood, called the model averaged likelihood, is now only a function of 8\. The prior P p (02) can take 
many forms, but since we discretized x, thereby making Tr(y\x) into a (conditional probability) matrix, the simplest 
choice for the prior is the uniform prior. In this case we set 82 equal to the entries in Tt(y\x) matrix and choose P{82) 
to be uniform over all valid matrices tt, such that the matrix entries are positive and the normalization constraint, 
J2x 7r (yl x ) = 1 f° r every x, is enforced. 



For any choice of priors we can rewrite Eq (21 ) as 



P{{Vt}\*) = J d8 2 P P (8 2 )exp 



T^2p(y,x)\ogTr(y\x) 



y,:r 



(22) 



which can be reorganized into 



P({yt}\s) = f d6 2 P p (6 2 )exp [T{l(y;x)-S(y) - (D KL (p(y\x) || 7r(y\x))) p(x) } 



(23) 



Here I(y; x) = ^2 y x p(y, x) log -^1^ is the empirical mutual information between spike counts y and the projection 

x, S(y) is the empirical spike count entropy, and the "correction" term in brackets measures the average Kullback- 
Leibler divergence (Dkl) between the empirical and model conditional distributions. Importantly, only this correction 
term is a function of the 7r and thus of 82, and is affected by the prior P p {82) which is being integrated over; the other 
terms can be pulled outside of the integral. We can therefore write the per time bin log likelihood as 



C = i log P({y t }\s) = I(y; x) - S(y) - A, 



(24) 



where the correction is 



A = -I log [ dd 2 P p (8 2 )e- T{DKL ^ x} 11 *(y\ x ))>?(*) . 



(25) 



It is necessary to show that as the amount of data T grows, the correction A decreases for a given choice of prior 
distribution P p (02), and for the choice of uniform prior this is analytically tractable (Kinney et al. 2007[ ). Intuitively, 



it is clear that as T — > 00, the empirical distribution p{y\x) converges to the true underlying distri 



DUtion p{y\x), and 



the integral becomes dominated by the extremal point 8\ , such that, in the saddle point approximation, 

A(T -4- 00) ~ (D KL (p(y\x) || -K*(y\x))) p{x) . 



(26) 



The distribution ir*(y\x) is the closest distribution to p(y\x) in the space over which the prior P p (82) is nonzero. As 
long as the prior assigns a non-zero probability to any (normalized) distribution, the divergence in A will decrease 
and A will vanish as T grows. The case in which A does not decay occurs when the prior completely excludes certain 
distributions by assigning zero probability, while the data p{y\x) precisely favors those excluded distributions. 
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Returning to the per time bin log likelihood C in Eq (24 1, as we decrease the time bin A, we enter a regime where 



there is only or 1 spike per bin, i.e., y € {0, 1}. Then the empirical information per time bin /(y; x) can be written 



I(y;x)=p(y = 0)D KL (p{x\y = 0)\\p(x)) +p(y = l)D KL (p(x\y = l)\\p(x)) , 



(27) 



that is, 



I(y; x) = p(silence)J si ie„cc + p(spike)/ sp iko- 



(28) 



If the information in the spike train is dominated by the information carried in spikes (Brenner et al. 2000), then 
the likelihood from Eq ( 24 1 becomes 



C = p(spike)i S p iko 



(29) 



where . . . are terms that either do not depend of the filter parameters 9± (i.e. entropy of the spike counts S(y)), or 
vanish as the size of dataset grows (A). 

The identity in Eq ( 29 ) is the sought-after connection between the inference using information maximization and the 
likelihood-based approach. In the limit of small time-bins, maximizing the information per spike / sp ik e ( m maximally 

of this paper), on right-hand side of the identity, 



informative approaches, as in (IS harp ee et al. 20041) and Section III. A 



is the same as maximizing the model averaged likelihood C of Eq ( 24 ) , on the left-hand side of the identity 
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